Hive Join、复杂数据类型操作

Hive 中的Join操作

Inner Join

inner join：只返回连接条件匹配上的数据

Left Join ：左外连接，以左表的全部数据和右边关联的数据，缺失补NULL

Right Join：右外连接，以右表的全部数据和左边关联的数据，缺失补NULL

Full Join：全外连接，返回全部的关联数据，缺失补NULL

hiveserver2

启动HiveServer2 服务端，使用beeline进行连接

注意：进入到Hive的bin目录执行beeline，因为Spark中也有beeline，当都配置到环境变量中的时候，后面的会覆盖前面的导致进错客户端

1	./beeline -u jdbc:hive2://bigdata01:10000/bigdata_hive -n hadoop

查看HiveServer2 的命令帮助

[hadoop@bigdata01 bin]$ hiveserver2 --help
19/12/19 20:21:18 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
usage: hiveserver2
    --deregister <versionNumber>   Deregister all instances of given
                                   version from dynamic service discovery
 -H,--help                         Print help information
    --hiveconf <property=value>    Use value for given property

hiveserver修改默认的端口号

在命令行中临时修改

1	hiveserver2 --hiveconf hive.server2.thrift.port=10066

在hive的配置文件hive-site.xml中进行修改

hive.server2.thrift.port
Default Value: 10000
Added In: Hive 0.11.0 with HIVE-2935

Hive的复杂数据类型

Array类型

数据

1 2	pk beijing,shanghai,tianjin,hangzhou jepson changchu,chengdu,wuhan,beijing

create table hive_array(
name string, 
loaction array<string>) 
row format delimited  fields terminated by '\t' COLLECTION ITEMS TERMINATED BY ',';

load data local inpath '/home/hadoop/data/hive_array.txt' overwrite into table hive_array;
select * from hive_array
INFO  : OK
+------------------+----------------------------------------------+--+
| hive_array.name  |             hive_array.loaction              |
+------------------+----------------------------------------------+--+
| pk               | ["beijing","shanghai","tianjin","hangzhou"]  |
| jepson           | ["changchun","chengdu","wuhan","beijing"]     |
+------------------+----------------------------------------------+--+

0: jdbc:hive2://bigdata01:10000/bigdata_hive> desc hive_array;
+-----------+----------------+----------+--+
| col_name  |   data_type    | comment  |
+-----------+----------------+----------+--+
| name      | string         |          |
| loaction  | array<string>  |          |
+-----------+----------------+----------+--+

求每个员工的第一个工作地点

select name,location[0] from hive_array;
INFO  : OK
+---------+-----------+--+
|  name   |    _c1    |
+---------+-----------+--+
| pk      | beijing   |
| jepson  | changchun  |
+---------+-----------+--+

求每个员工的工作地点有几个

select name ,size(loaction) from hive_array;
INFO  : OK
+---------+------+--+
|  name   | _c1  |
+---------+------+--+
| pk      | 4    |
| jepson  | 4    |
+---------+------+--+

求在天津上班的员工

select * from hive_array where array_contains(loaction,'tianjin');
INFO  : OK
+------------------+----------------------------------------------+--+
| hive_array.name  |             hive_array.loaction              |
+------------------+----------------------------------------------+--+
| pk               | ["beijing","shanghai","tianjin","hangzhou"]  |
+------------------+----------------------------------------------+--+

Map类型

数据

1,zhangsan,father:xiaoming#mother:xiaohuang#brother:xiaoxu,28
2,lisi,father:mayun#mother:huangyi#brother:guanyu,22
3,wangwu,father:wangjianlin#mother:ruhua#sister:jingtian,29
4,mayun,father:mayongzhen#mother:angelababy,26

建表语句：

create table hive_map(
id int,
name string,
members map<string,string>,
age int
) row format delimited fields terminated by ',' 
collection items terminated by '#' 
map keys terminated by ':';

加载数据：

1	load data local inpath '/home/hadoop/data/hive_map.txt' into table hive_map;

+--------------+----------------+--------------------------------------------------------------------+---------------+
| hive_map.id  | hive_map.name  |                  hive_map.members                                  | hive_map.age  |
+--------------+----------------+--------------------------------------------------------------------+---------------+
| 1            | zhangsan       | {"father":"xiaoming","mother":"xiaohuang","brother":"xiaoxu"}      | 28            |
| 2            | lisi           | {"father":"mayun","mother":"huangyi","brother":"guanyu"}           | 22            |
| 3            | wangwu         | {"father":"wangjianlin","mother":"ruhua","sister":"jingtian"}      | 29            |
| 4            | mayun          | {"father":"mayongzhen","mother":"angelababy"}                      | 26            |
+--------------+----------------+--------------------------------------------------------------------+---------------+

需求：求每个人的父母

select name,members['father'] as father,members['mother'] as mother from hive_map;
INFO  : OK
+-----------+--------------+-------------+--+
|   name    |    father    |   mother    |
+-----------+--------------+-------------+--+
| zhangsan  | xiaoming     | xiaohuang   |
| lisi      | mayun        | huangyi     |
| wangwu    | wangjianlin  | ruhua       |
| mayun     | mayongzhen   | angelababy  |
+-----------+--------------+-------------+--+

求每个人的亲属关系即map的key

select id,name,map_keys(members) as members from hive_map;

INFO  : OK
+-----+-----------+--------------------------------+--+
| id  |   name    |            members             |
+-----+-----------+--------------------------------+--+
| 1   | zhangsan  | ["father","mother","brother"]  |
| 2   | lisi      | ["father","mother","brother"]  |
| 3   | wangwu    | ["father","mother","sister"]   |
| 4   | mayun     | ["father","mother"]            |
+-----+-----------+--------------------------------+--+

求每个人亲属关系的姓名即取所有的values

select id,name,map_values(members) as members from hive_map;

INFO  : OK
+-----+-----------+-------------------------------------+--+
| id  |   name    |               members               |
+-----+-----------+-------------------------------------+--+
| 1   | zhangsan  | ["xiaoming","xiaohuang","xiaoxu"]   |
| 2   | lisi      | ["mayun","huangyi","guanyu"]        |
| 3   | wangwu    | ["wangjianlin","ruhua","jingtian"]  |
| 4   | mayun     | ["mayongzhen","angelababy"]         |
+-----+-----------+-------------------------------------+--+

求每个人的关系有几个

select id,name,size(members) from hive_map;

INFO  : OK
+-----+-----------+------+--+
| id  |   name    | _c2  |
+-----+-----------+------+--+
| 1   | zhangsan  | 3    |
| 2   | lisi      | 3    |
| 3   | wangwu    | 3    |
| 4   | mayun     | 2    |
+-----+-----------+------+--+

查询出有兄弟的人，以及他的兄弟是谁？

运用数组里面的array_contains，先将其转换成为数组，在对数组类型进行操作

select id,name,members['brother'] brother from hive_map where array_contains(map_keys(members),'brother');
+-----+-----------+----------+--+
| id  |   name    | brother  |
+-----+-----------+----------+--+
| 1   | zhangsan  | xiaoxu   |
| 2   | lisi      | guanyu   |
+-----+-----------+----------+--+

结构体

结构体内部的数据类型可以不同

structs: STRUCT<col_name : data_type [COMMENT col_comment], …>

数据：

[root@bigdata01 data]# cat hive_struct.txt 
192.168.1.1#zhangsan:40
192.168.1.2#lisi:50
192.168.1.3#wangwu:60
192.168.1.4#zhaoliu:70

建表：

create table hive_struct(
id string,
info struct<name:string,age:int>
) row format delimited fields terminated by '#' 
collection items terminated by ':' ;

加载数据：

load data local inpath '/home/hadoop/data/hive_struct.txt' into table hive_struct;


INFO  : OK
+-----------------+-------------------------------+--+
| hive_struct.id  |       hive_struct.info        |
+-----------------+-------------------------------+--+
| 192.168.1.1     | {"name":"zhangsan","age":40}  |
| 192.168.1.2     | {"name":"lisi","age":50}      |
| 192.168.1.3     | {"name":"wangwu","age":60}    |
| 192.168.1.4     | {"name":"zhaoliu","age":70}   |
+-----------------+-------------------------------+--+

取数据：

select id,info.name,info.age from hive_struct;
INFO  : OK
+--------------+-----------+------+--+
|      id      |   name    | age  |
+--------------+-----------+------+--+
| 192.168.1.1  | zhangsan  | 40   |
| 192.168.1.2  | lisi      | 50   |
| 192.168.1.3  | wangwu    | 60   |
| 192.168.1.4  | zhaoliu   | 70   |
+--------------+-----------+------+--+

click_log数据情况：

create table click_log(
cookie_id string,
ad_id string,
time string
) row format delimited fields terminated by '\t';

加载数据：

load data local inpath '/home/hadoop/data/click_log.txt' into table click_log;

+----------------------+------------------+-----------------------------+--+
| click_log.cookie_id  | click_log.ad_id  |       click_log.time        |
+----------------------+------------------+-----------------------------+--+
| 11                   | ad_101           | 2014-05-01 06:01:12.334+01  |
| 22                   | ad_102           | 2014-05-01 07:28:12.342+01  |
| 33                   | ad_103           | 2014-05-01 07:50:12.33+01   |
| 11                   | ad_104           | 2014-05-01 09:27:12.33+01   |
| 22                   | ad_103           | 2014-05-01 09:03:12.324+01  |
| 33                   | ad_102           | 2014-05-02 19:10:12.343+01  |
| 11                   | ad_101           | 2014-05-02 09:07:12.344+01  |
| 35                   | ad_105           | 2014-05-03 11:07:12.339+01  |
| 22                   | ad_104           | 2014-05-03 12:59:12.743+01  |
| 77                   | ad_103           | 2014-05-03 18:04:12.355+01  |
| 99                   | ad_102           | 2014-05-04 00:36:39.713+01  |
| 33                   | ad_101           | 2014-05-04 19:10:12.343+01  |
| 11                   | ad_101           | 2014-05-05 09:07:12.344+01  |
| 35                   | ad_102           | 2014-05-05 11:07:12.339+01  |
| 22                   | ad_103           | 2014-05-05 12:59:12.743+01  |
| 77                   | ad_104           | 2014-05-05 18:04:12.355+01  |
| 99                   | ad_105           | 2014-05-05 20:36:39.713+01  |
+----------------------+------------------+-----------------------------+--+

ad_list表数据情况：

create table  ad_list(
ad_id string,
url string,
catalogs string
) row format delimited fields terminated by '\t';

加载数据：

load data local inpath '/home/hadoop/data/ad_list.txt' into table ad_list;

INFO  : OK
+----------------+------------------------+--------------------------------------+--+
| ad_list.ad_id  |      ad_list.url       |           ad_list.catalogs           |
+----------------+------------------------+--------------------------------------+--+
| ad_101         | http://www.google.com  | catalog8|catalog1                    |
| ad_102         | http://www.sohu.com    | catalog6|catalog3                    |
| ad_103         | http://www.baidu.com   | catalog7                             |
| ad_104         | http://www.qq.com      | catalog5|catalog1|catalog4|catalog9  |
| ad_105         | http://sina.com        | NULL                                 |
+----------------+------------------------+--------------------------------------+--+

用clicklog 和adlist进行分析

select * from click_log
+----------------------+------------------+-----------------------------+--+
| click_log.cookie_id  | click_log.ad_id  |       click_log.time        |
+----------------------+------------------+-----------------------------+--+
| 11                   | ad_101           | 2014-05-01 06:01:12.334+01  |
| 22                   | ad_102           | 2014-05-01 07:28:12.342+01  |
| 33                   | ad_103           | 2014-05-01 07:50:12.33+01   |
| 11                   | ad_104           | 2014-05-01 09:27:12.33+01   |
| 22                   | ad_103           | 2014-05-01 09:03:12.324+01  |
| 33                   | ad_102           | 2014-05-02 19:10:12.343+01  |
| 11                   | ad_101           | 2014-05-02 09:07:12.344+01  |
| 35                   | ad_105           | 2014-05-03 11:07:12.339+01  |
| 22                   | ad_104           | 2014-05-03 12:59:12.743+01  |
| 77                   | ad_103           | 2014-05-03 18:04:12.355+01  |
| 99                   | ad_102           | 2014-05-04 00:36:39.713+01  |
| 33                   | ad_101           | 2014-05-04 19:10:12.343+01  |
| 11                   | ad_101           | 2014-05-05 09:07:12.344+01  |
| 35                   | ad_102           | 2014-05-05 11:07:12.339+01  |
| 22                   | ad_103           | 2014-05-05 12:59:12.743+01  |
| 77                   | ad_104           | 2014-05-05 18:04:12.355+01  |
| 99                   | ad_105           | 2014-05-05 20:36:39.713+01  |
+----------------------+------------------+-----------------------------+--+

求每一个人所访问的ad_id 去重

select cookie_id,collect_set(ad_id) from click_log group by cookie_id;

INFO  : OK
+------------+-------------------------------+--+
| cookie_id  |              _c1              |
+------------+-------------------------------+--+
| 11         | ["ad_101","ad_104"]           |
| 22         | ["ad_102","ad_103","ad_104"]  |
| 33         | ["ad_103","ad_102","ad_101"]  |
| 35         | ["ad_105","ad_102"]           |
| 77         | ["ad_103","ad_104"]           |
| 99         | ["ad_102","ad_105"]           |
+------------+-------------------------------+--+

不用去重操作可以用 collect_list()函数

select cookie_id,collect_list(ad_id) from click_log group by cookie_id;

+------------+----------------------------------------+--+
| cookie_id  |                  _c1                   |
+------------+----------------------------------------+--+
| 11         | ["ad_101","ad_104","ad_101","ad_101"]  |
| 22         | ["ad_102","ad_103","ad_104","ad_103"]  |
| 33         | ["ad_103","ad_102","ad_101"]           |
| 35         | ["ad_105","ad_102"]                    |
| 77         | ["ad_103","ad_104"]                    |
| 99         | ["ad_102","ad_105"]                    |
+------------+----------------------------------------+--+

统计每个人对于每个广告的访问量

select cookie_id,ad_id ,count(1) from click_log group by  cookie_id,ad_id;

+------------+---------+------+--+
| cookie_id  |  ad_id  | _c2  |
+------------+---------+------+--+
| 11         | ad_101  | 3    |
| 11         | ad_104  | 1    |
| 22         | ad_102  | 1    |
| 22         | ad_103  | 2    |
| 22         | ad_104  | 1    |
| 33         | ad_101  | 1    |
| 33         | ad_102  | 1    |
| 33         | ad_103  | 1    |
| 35         | ad_102  | 1    |
| 35         | ad_105  | 1    |
| 77         | ad_103  | 1    |
| 77         | ad_104  | 1    |
| 99         | ad_102  | 1    |
| 99         | ad_105  | 1    |
+------------+---------+------+--+

在上述的基础上面添加类别，即将ad_id对应的类别取出来，去ad_list关联出

select click.cookie_id,click.ad_id,click.amount,ad_list.catalogs
from 
(select cookie_id,ad_id ,count(1) amount from click_log group by  cookie_id,ad_id) click 
join ad_list 
on ad_list.ad_id = click.ad_id;

+------------------+--------------+---------------+--------------------------------------+--+
| click.cookie_id  | click.ad_id  | click.amount  |           ad_list.catalogs           |
+------------------+--------------+---------------+--------------------------------------+--+
| 11               | ad_101       | 3             | catalog8|catalog1                    |
| 11               | ad_104       | 1             | catalog5|catalog1|catalog4|catalog9  |
| 22               | ad_102       | 1             | catalog6|catalog3                    |
| 22               | ad_103       | 2             | catalog7                             |
| 22               | ad_104       | 1             | catalog5|catalog1|catalog4|catalog9  |
| 33               | ad_101       | 1             | catalog8|catalog1                    |
| 33               | ad_102       | 1             | catalog6|catalog3                    |
| 33               | ad_103       | 1             | catalog7                             |
| 35               | ad_102       | 1             | catalog6|catalog3                    |
| 35               | ad_105       | 1             | NULL                                 |
| 77               | ad_103       | 1             | catalog7                             |
| 77               | ad_104       | 1             | catalog5|catalog1|catalog4|catalog9  |
| 99               | ad_102       | 1             | catalog6|catalog3                    |
| 99               | ad_105       | 1             | NULL                                 |
+------------------+--------------+---------------+--------------------------------------+--+

catalogs字段的列转行

ad_101 | catalog8|catalog1 转换成为

ad_101 |catalog1
ad_101 |catalog8

select ad_id,catalog from ad_list lateral view outer explode(split(catalogs,'\\|')) t as catalog;
+---------+-----------+--+
|  ad_id  |  catalog  |
+---------+-----------+--+
| ad_101  | catalog8  |
| ad_101  | catalog1  |
| ad_102  | catalog6  |
| ad_102  | catalog3  |
| ad_103  | catalog7  |
| ad_104  | catalog5  |
| ad_104  | catalog1  |
| ad_104  | catalog4  |
| ad_104  | catalog9  |
| ad_105  | NULL      |
+---------+-----------+--+

将上述的outer去掉最后的NULL也就去掉了。

需求：将下面的catalogs列中的进行排序

+---------+--------------------------------------+--+
|  ad_id  |               catalogs               |
+---------+--------------------------------------+--+
| ad_101  | catalog8|catalog1                    |
| ad_102  | catalog6|catalog3                    |
| ad_103  | catalog7                             |
| ad_104  | catalog5|catalog1|catalog4|catalog9  |
| ad_105  | NULL                                 |
+---------+--------------------------------------+--

转换为数组进行排序
将表加载成为数组：

1 2	create table ad_list_2(ad_id string,url string,catalogs array<string>)row format delimited fields terminated by '\t' COLLECTION ITEMS TERMINATED BY '\|';

加载数据：

load data local inpath '/home/hadoop/data/ad_list.txt' into table ad_list_2;


+------------------+------------------------+------------------------------------------------+--+
| ad_list_2.ad_id  |     ad_list_2.url      |               ad_list_2.catalogs               |
+------------------+------------------------+------------------------------------------------+--+
| ad_101           | http://www.google.com  | ["catalog8","catalog1"]                        |
| ad_102           | http://www.sohu.com    | ["catalog6","catalog3"]                        |
| ad_103           | http://www.baidu.com   | ["catalog7"]                                   |
| ad_104           | http://www.qq.com      | ["catalog5","catalog1","catalog4","catalog9"]  |
| ad_105           | http://sina.com        | NULL                                           |
+------------------+------------------------+------------------------------------------------+--+

结果：

select ad_id,url,sort_array(catalogs) from ad_list_2;
INFO  : OK
+---------+------------------------+------------------------------------------------+--+
|  ad_id  |          url           |                      _c2                       |
+---------+------------------------+------------------------------------------------+--+
| ad_101  | http://www.google.com  | ["catalog1","catalog8"]                        |
| ad_102  | http://www.sohu.com    | ["catalog3","catalog6"]                        |
| ad_103  | http://www.baidu.com   | ["catalog7"]                                   |
| ad_104  | http://www.qq.com      | ["catalog1","catalog4","catalog5","catalog9"]  |
| ad_105  | http://sina.com        | NULL                                           |
+---------+------------------------+------------------------------------------------+--+

时间函数

查看当前时间：
select current_stamp from dual;
INFO : OK

+--------------------------+
|           _c0            |
+--------------------------+
| 2017-09-06 10:56:56.992  |
+--------------------------+

计算自1970年以来的毫秒值：

select unix_timestamp from dual;

+-------------+
|     _c0     |
+-------------+
| 1504666616  |
+-------------+

unix_timestamp转换时间为自1970-01-01以来的秒值

desc function extended unix_timestamp;
INFO  : OK
+--------------------------------------------------------------------------------------------+
|                      tab_name                                                              |
+--------------------------------------------------------------------------------------------+
| unix_timestamp(date[, pattern]) - Converts the time to a number                            |
| Converts the specified time to number of seconds since 1970-01-01. The unix_timestamp(void) overload is deprecated, use current_timestamp.                                               |
+--------------------------------------------------------------------------------------------+

select unix_timestamp('2017-09-06 10:56:56') from dual;
INFO  : OK
+-------------+
|     _c0     |
+-------------+
| 1504666616  |
+-------------+

select unix_timestamp('20170201 20:12:12','yyyyMMdd HH:mm:ss') from dual;

INFO  : OK
+-------------+
|     _c0     |
+-------------+
| 1485951132  |
+-------------+

将给定的unix的秒数转换成为指定的格式的日期

desc function extended from_unixtime();
将给定的unix的秒数转换成为指定的格式的日期
+--------------------------------------------------------------------------------+
|                      tab_name                                                  |
+--------------------------------------------------------------------------------+
| from_unixtime(unix_time, format) - returns unix_time in the specified format   |
| Example:                                                                       |
|   > SELECT from_unixtime(0, 'yyyy-MM-dd HH:mm:ss') FROM src LIMIT 1;           |
|   '1970-01-01 00:00:00'                                                        |
+--------------------------------------------------------------------------------+

字符串转日期：to_date

desc function extended to_date;	
INFO  : OK
+--------------------------------------------------------------------------------+
|                      tab_name                                                  |
+--------------------------------------------------------------------------------+
| to_date(expr) - Extracts the date part of the date or datetime expression expr |
| Example:                                                                       |
|    > SELECT to_date('2009-07-30 04:17:52') FROM src LIMIT 1;                   |
|   '2009-07-30'                                                                 |
+--------------------------------------------------------------------------------+

select to_date('2022-12-12 12:45:12') from dual;
INFO  : OK
+-------------+
|     _c0     |
+-------------+
| 2022-12-12  |
+-------------+

年:
select year('2012-12-22') from dual;
INFO  : OK
+-------+
|  _c0  |
+-------+
| 2012  |
+-------+

月：
select month('2022-12-12 12:45:12') from dual;
INFO  : OK
+------+
| _c0  |
+------+
| 12   |
+------+
日：day 	时：hour 分：minute 秒：second

日期加减运算date_add & date_sub

1 2	date_add(start_date, num_days) - Returns the date that is num_days after start_date. date_sub(start_date, num_days) - Returns the date that is num_days before start_date.

类型转换：

1
2
3

cast(value as type)
select cast('10' as int) from dual;
select cast(current_timestamp as date) from dual;

注意：上述情况Null值转换不了

binary 只能直接转成string

数学运算

round()	四舍五入
ceil()
ceiling()
floor()
abs()
least()	
greatest()

字符串：

substr
substring
concat
concat_ws
select concat_ws(".","192","168","52","45") from dual;
INFO  : OK
+----------------+
|      _c0       |
+----------------+
| 192.168.52.45  |
+----------------+
1 row selected (0.064 

split
select split("192.168.52.100","\\.") from dual;
INFO  : OK
+---------------------------+
|            _c0            |
+---------------------------+
| ["192","168","52","100"]  |
+---------------------------+

JSON 字符串操作

create table rating(json string)
INFO  : OK
+-------------------------------------------------------------+
|                    rating.json                              |
+-------------------------------------------------------------+
| {"movie":"1193","rate":"5","time":"978300760","userid":"1"} |
| {"movie":"661","rate":"3","time":"978300761","userid":"1"}  |
| {"movie":"914","rate":"3","time":"978300762","userid":"1"}  |
| {"movie":"325","rate":"4","time":"978300763","userid":"1"}  |
| {"movie":"1443","rate":"3","time":"978300760","userid":"1"} |
| {"movie":"1893","rate":"2","time":"978301760","userid":"1"} |
| {"movie":"1153","rate":"3","time":"978302760","userid":"1"} |
| {"movie":"1893","rate":"1","time":"978303760","userid":"1"} |
+-------------------------------------------------------------+

json_tuple函数

desc function extended json_tuple;

+---------------------------------------------------------------------------------------------+
|                      tab_name                                                               |
+---------------------------------------------------------------------------------------------+
| json_tuple(jsonStr, p1, p2, ..., pn) - like get_json_object, but it takes multiple names and return a tuple. All the input parameters and output column types are string.                  |
+---------------------------------------------------------------------------------------------+

使用案例：

select json_tuple(json,'movie','rate','time','userid') as (movie,rate,time,userid) from rating;

INFO  : OK
+--------+-------+------------+---------+
| movie  | rate  |    time    | userid  |
+--------+-------+------------+---------+
| 1193   | 5     | 978300760  | 1       |
| 661    | 3     | 978300761  | 1       |
| 914    | 3     | 978300762  | 1       |
| 325    | 4     | 978300763  | 1       |
| 1443   | 3     | 978300760  | 1       |
| 1893   | 2     | 978301760  | 1       |
| 1153   | 3     | 978302760  | 1       |
| 1893   | 1     | 978303760  | 1       |
+--------+-------+------------+---------+

select userid,movie,rate,year(from_unixtime(cast(time as bigint))) as date from (select json_tuple(json,'movie','rate','time','userid') as (movie,rate,time,userid) from rating) t;
+---------+--------+-------+-------+
| userid  | movie  | rate  | date  |
+---------+--------+-------+-------+
| 1       | 1193   | 5     | 2001  |
| 1       | 661    | 3     | 2001  |
| 1       | 914    | 3     | 2001  |
| 1       | 325    | 4     | 2001  |
| 1       | 1443   | 3     | 2001  |
| 1       | 1893   | 2     | 2001  |
| 1       | 1153   | 3     | 2001  |
| 1       | 1893   | 1     | 2001  |
+---------+--------+-------+-------+

select userid,movie,rate,to_date(from_unixtime(time, 'yyyy-MM-dd HH:mm:ss')) as date from (select json_tuple(json,'movie','rate','time','userid') as (movie,rate,time,userid) from rating) t;



select userid,movie,rate,to_date(from_unixtime(cast(time as bigint), 'yyyy-MM-dd HH:mm:ss')) as datess from (select json_tuple(json,'movie','rate','time','userid') as (movie,rate,time,userid) from rating) t;
+---------+--------+-------+-------------+
| userid  | movie  | rate  |   datess    |
+---------+--------+-------+-------------+
| 1       | 1193   | 5     | 2001-01-01  |
| 1       | 661    | 3     | 2001-01-01  |
| 1       | 914    | 3     | 2001-01-01  |
| 1       | 325    | 4     | 2001-01-01  |
| 1       | 1443   | 3     | 2001-01-01  |
| 1       | 1893   | 2     | 2001-01-01  |
| 1       | 1153   | 3     | 2001-01-01  |
| 1       | 1893   | 1     | 2001-01-01  |
+---------+--------+-------+-------------+

解析URL函数parse_url_tuple

desc function extended parse_url_tuple
+------------------------------------------------------------------------------------------------------------------------+
|                      tab_name                                                                                          |
+------------------------------------------------------------------------------------------------------------------------+
| parse_url_tuple(url, partname1, partname2, ..., partnameN) - extracts N (N>=1) parts from a URL.                       |
| It takes a URL and one or multiple partnames, and returns a tuple. All the input parameters and output column types are 
| string.                                                                                                                |
| Partname: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO, QUERY:<KEY_NAME>                                |
| Note: Partnames are case-sensitive, and should not contain unnecessary white spaces.                                   |
| Example:                                                                                                               |
|   > SELECT b.* FROM src LATERAL VIEW parse_url_tuple(fullurl, 'HOST', 'PATH', 'QUERY', 'QUERY:id') b as host, path,
| query, query_id LIMIT 1;                                                                                               |
|   > SELECT parse_url_tuple(a.fullurl, 'HOST', 'PATH', 'QUERY', 'REF', 'PROTOCOL', 'FILE',  'AUTHORITY', 'USERINFO', 
| 'QUERY:k1') as (ho, pa, qu, re, pr, fi, au, us, qk1) from src a;                                                       |
+------------------------------------------------------------------------------------------------------------------------+

select parse_url_tuple("http://www.baidu.com/big/webpage?cookieid=10","HOST","PATH","QUERY") from dual;

INFO  : OK
+----------------+---------------+--------------+
|       c0       |      c1       |      c2      |
+----------------+---------------+--------------+
| www.baidu.com  | /big/webpage  | cookieid=10  |
+----------------+---------------+--------------+

select parse_url_tuple("http://www.baidu.com/big/webpage?cookieid=10","HOST","PATH","QUERY","QUERY:cookieid") from dual;

INFO  : OK
+----------------+---------------+--------------+-----+
|       c0       |      c1       |      c2      | c3  |
+----------------+---------------+--------------+-----+
| www.baidu.com  | /big/webpage  | cookieid=10  | 10  |
+----------------+---------------+--------------+-----+

其他函数

判断是否是NULL值

select ename,job,comm,isnull(comm) isnull,isnotnull(comm) isnotnull from emp;

INFO  : OK
+---------+------------+---------+---------+------------+
|  ename  |    job     |  comm   | isnull  | isnotnull  |
+---------+------------+---------+---------+------------+
| SMITH   | CLERK      | NULL    | true    | false      |
| ALLEN   | SALESMAN   | 300.0   | false   | true       |
| WARD    | SALESMAN   | 500.0   | false   | true       |
| JONES   | MANAGER    | NULL    | true    | false      |
| MARTIN  | SALESMAN   | 1400.0  | false   | true       |
| BLAKE   | MANAGER    | NULL    | true    | false      |
| CLARK   | MANAGER    | NULL    | true    | false      |
| SCOTT   | ANALYST    | NULL    | true    | false      |
| KING    | PRESIDENT  | NULL    | true    | false      |
| TURNER  | SALESMAN   | 0.0     | false   | true       |
| ADAMS   | CLERK      | NULL    | true    | false      |
| JAMES   | CLERK      | NULL    | true    | false      |
| FORD    | ANALYST    | NULL    | true    | false      |
| MILLER  | CLERK      | NULL    | true    | false      |
| HIVE    | PROGRAM    | NULL    | true    | false      |
+---------+------------+---------+---------+------------+

assert_true函数

desc function extended assert_true;
INFO  : OK
+----------------------------------------------------+
|                      tab_name                      |
+----------------------------------------------------+
| assert_true(condition) - Throw an exception if 'condition' is not true. |
| Example:                                           |
|    > SELECT assert_true(x >= 0) FROM src LIMIT 1;  |
|   NULL                                             |
+----------------------------------------------------+

elt()函数

desc function extended elt;

INFO  : OK
+----------------------------------------------------+
|                      tab_name                      |
+----------------------------------------------------+
| elt(n, str1, str2, ...) - returns the n-th string  |
| Example:                                           |
|   > SELECT elt(1, 'face', 'book') FROM src LIMIT 1;|
|   'face'                                           |
+----------------------------------------------------+

select elt(2,"hello","hive","spark") from dual;
INFO  : OK
+-------+
|  _c0  |
+-------+
| hive  |
+-------+

处理空值函数NVL

desc function extended nvl;

INFO  : OK
+----------------------------------------------------+
|                      tab_name                      |
+----------------------------------------------------+
| nvl(value,default_value) - Returns default value if value is null else returns value |
| Example:                                           |
|   > SELECT nvl(null,'bla') FROM src LIMIT 1;       |
|   bla                                              |
+----------------------------------------------------+

select ename,comm,nvl(comm,'sss') from emp;
INFO  : OK
+---------+---------+---------+
|  ename  |  comm   |   _c2   |
+---------+---------+---------+
| SMITH   | NULL    | sss     |
| ALLEN   | 300.0   | 300.0   |
| WARD    | 500.0   | 500.0   |
| JONES   | NULL    | sss     |
| MARTIN  | 1400.0  | 1400.0  |
| BLAKE   | NULL    | sss     |
| CLARK   | NULL    | sss     |
| SCOTT   | NULL    | sss     |
| KING    | NULL    | sss     |
| TURNER  | 0.0     | 0.0     |
| ADAMS   | NULL    | sss     |
| JAMES   | NULL    | sss     |
| FORD    | NULL    | sss     |
| MILLER  | NULL    | sss     |
| HIVE    | NULL    | sss     |
+---------+---------+---------+

总结

Hive的Array类型：
创建表时候指定字段为array类型 location array
指定array中每个的分隔符COLLECTION ITEMS TERMINATED BY ‘,’

取出数据：
直接用对应字段的名称加上索引取出值 location[0]
求array里面数据的大小用size(location)
求array里面数据包含某个数据用array_contains(location,‘shanghai’)

Hive Map类型：
创建表：指定Map类型 members Map<string,string>
指定Map Key的分隔符和字段之间的分隔符
collection items terminated by '#'
map keys terminated by ‘:’;

取数据：
求Map数据的Key对应的Value值 members[‘father’]
求Map数据的所有的key map_keys(members)
求Map数据的所有的Value map_values(members)
求Map数据的Key-Value 的个数 size(members)

Hive 结构体：
创建表时候指定字段为array类型 info struct<name:string,age:int>
指定array中每个的分隔符COLLECTION ITEMS TERMINATED BY ','
取数据直接是进行字段名称点字段值例如info.age

列转行函数：
lateral view outer explode(split(catalogs,’\|’)) t as catalog;
对Array类型数据进行排序：
select ad_id,url,sort_array(catalogs) from ad_list_2;

Hive 中的Join操作

hiveserver2

Hive的复杂数据类型

Array类型

Map类型

结构体

时间函数

数学运算

JSON 字符串操作

解析URL函数parse_url_tuple

其他函数

判断是否是NULL值

assert_true函数

elt()函数

处理空值函数NVL

总结

FEATURED TAGS

FRIENDS